Redundancy-aware unsupervised ranking based on game theory: Ranking pathways in collections of gene sets

In Genetics, gene sets are grouped in collections concerning their biological function. This often leads to high-dimensional, overlapping, and redundant families of sets, thus precluding a straightforward interpretation of their biological meaning. In Data Mining, it is often argued that techniques to reduce the dimensionality of data could increase the maneuverability and consequently the interpretability of large data. In the past years, moreover, we witnessed an increasing consciousness of the importance of understanding data and interpretable models in the machine learning and bioinformatics communities. On the one hand, there exist techniques aiming to aggregate overlapping gene sets to create larger pathways. While these methods could partly solve the large size of the collections’ problem, modifying biological pathways is hardly justifiable in this biological context. On the other hand, the representation methods to increase interpretability of collections of gene sets that have been proposed so far have proved to be insufficient. Inspired by this Bioinformatics context, we propose a method to rank sets within a family of sets based on the distribution of the singletons and their size. We obtain sets’ importance scores by computing Shapley values; Making use of microarray games, we do not incur the typical exponential computational complexity. Moreover, we address the challenge of constructing redundancy-aware rankings where, in our case, redundancy is a quantity proportional to the size of intersections among the sets in the collections. We use the obtained rankings to reduce the dimension of the families, therefore showing lower redundancy among sets while still preserving a high coverage of their elements. We finally evaluate our approach for collections of gene sets and apply Gene Sets Enrichment Analysis techniques to the now smaller collections: As expected, the unsupervised nature of the proposed rankings allows for unremarkable differences in the number of significant gene sets for specific phenotypic traits. In contrast, the number of performed statistical tests can be drastically reduced. The proposed rankings show a practical utility in bioinformatics to increase interpretability of the collections of gene sets and a step forward to include redundancy-awareness into Shapley values computations.

otherwise for any S ⊆ N It can be proved that given any cooperative game (N , v), the value function v can be written as the linear combination of Unanimity Games in a unique way, i.e., v(·) = T ∈P(N ) where λ T (v) ∈ R are called unanimity coefficients and are determined by the formula As we see, the computation of λ T (v), as well as the one of φ v (i) becomes intractable if N increases. The SOUG allow for polynomial time computation of Shapley values. In particular, the computation in terms of the unaminity coefficients λ T (v) is reduced to for each player i in N . It can be proven that any cooperative game (N , v) has a unique formulation as a sum of unanimity games. However, finding the equivalent SOUG of a game (N , v) is computationally equivalently hard as computing the Shapley values.
Using SOUG brings the essential advantage of polynomial run-time when dealing with big families of sets, e.g., gene sets and pathways.

B Glove game
A classical example of a cooperative game is the so-called glove game. Consider the set of players {A, B, C}; A and B are right-hand gloves while C is a left-hand glove. A coalition, i.e., a subset of {A, B, C}, has value 1 if it contains a pair of gloves (left + right) and has value 0 if it does not. A person is wearing one pair of gloves at a time, therefore adding more gloves to a coalition already containing a pair of gloves is useless; we represent this mathematically -any coalition containing a pair does not increase its worth when including more gloves. The grand coalition {A, B, C} contains one pair of gloves, i.e., the pair {A, C} or the pair {B, C}, therefore it has value equals to 1. Note that the value function assigns 1 to the grand coalition and 0 to the empty set. After computing the Shapley values, we find φ(A) = φ(B) = 1 6 and φ(C) = 1 3 . Players A and B get the same Shapley values as they are essentially indistinguishable. Shapley values scores do not detect the existing redundancy among A and B. After including one element among A and B, including the other does not yield any advantages. We refer to this similarity among players as redundancy and we say that the Shapley values are unaware of redundancy among players.

C Example of computation of Shapley values
We give here an example on how to compute the Shapley values in a toy example. Consider a family of sets The union of the sets in F is G = {a, b, c, d, e}. We construct the binary matrix B, i.e., where each column represents an elements of G, respectively a, b, c, d, e and the zeroes and ones components of B represent the binary relationship of being included in each set of F. We then get the dictionary A as described in Equation (3) in Section 3.3. From the first column of B follows that a belongs to P 1 and P 2 but not to P 3 thus the set {P 1 , P 2 } is the firstset in A. After applying the same procedure on each column of B, we get Now, we calculate the Shapley values as in Equation (4), leading to As expected, we observe that the sum of Shapley values equals 1. Moreover, we can notice that there is a bias towards sets that have higher dimensions, i.e., the ordering P 1 , P 3 and P 2 reflects also the ordering of the sets with respect to their sizes. This example illustrates that Shapley values do not aim for low intersection among ranked sets or high coverage. The sets in F are ordered according to Shapley values as P 1 , P 3 and P 2 . However P 1 and P 3 share two elements while P 1 and P 2 share only one. If we want to select two out of these three subsets in order to maximize the coverage of G while keeping low the overlapping among the sets, we would better select P 1 and P 2 instead of P 1 and P 3 .

D Different rankings
We introduced in the paper four different orderings which depend on different penalization criteria. The difference among the penalization criteria is reflected in the coverage and redundancy reduction performances of the various orderings. We give here detailed information on the algorithm for sake of completeness and for reproducibility purposes.
The Shapley values need to be re-computed after each iteration as, after the selection of some pathways, they do not sum anymore to 1. Moreover, the removal of one set can change the ordering of the other sets as the Shapley values depend on the distribution of the elements among the sets.
We give here more details on the construction of the various penalized rankings in the context of pathways' orderings. The details here introduced are true in general in any family of sets.
• Penalized Ordering (PO) -The penalty grows at each step as we add it to the Jaccard index with the last ranked pathway: the score S n+1 (P ) obtained by the pathway P (where P has not been ranked yet at step n + 1) is given by the following iterative formula and, at the step n + 1, the algorithm ranks the setP n+1 = arg max P S n+1 (P ). φ n (·) represents the Shapley value function at the nth iteration.
• Penalized Ordering Re-scaled (POR) -The algorithm shows strong similarities with PO. The only difference relies in the penalty which is re-scaled in POR to the interval [0, max P {φ i (P )}]. This avoids that, after the first n steps, the highest importance scores assume negative real numbers as the penalties of PO can assume values higher than 1.
The importance scores are defined as follows: The complexity of the algorithm proposed does not change.
• Artificial Ordering (AO) -After ranking the first pathway, it computes an artificial pathway which includes all genes belonging at least to one of the pathways previously selected.
In order to rank the (n + 1)-th set, PO (and POR) penalizes multiple times the overlap among sets that share some elements g. The introduction of the artificial gene set AP n i.e., AP n = ∪ n i=0 arg max P S i (P ).
• Artificial Ordering Re-scaled (AOR) -The difference with the AO ordering is the re-scaling of the penalty to the interval [0, max P {φ i (P )}].
S 1 (P ) = φ 1 (P ) As in POR, the complexity of the algorithm proposed does not change.   Table 3. Comparison of the number of significant pathways. Number of significant pathways found using Fisher Exact Test and FDR correction when using the 40% of the pathways in the collection of gene sets ranked using the proposed orderings (namely, SV, PO, POR, AO and AOR) and the complete collection of gene sets (ALL); the ENR row contains the number of significant pathways for each of the phenotypic traits using the Enrich method.